## [1] "/Users/Liu/Self-learning/DataAnalytics/project4"
First let’s run some basic functions to have a picture of the dataset. Our dataset consists of 13 variables, with almost 1,599 observations.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Since we are primarily interested in the quality of the red wine, let’s see the statistics of it first. We can find that the quality is between 3 and 8.
According to the explanation in the wineQualityinfor.txt, we know the score of wine is between 0 anf 10. And the data we get is only between 3 and 8, it’s better to category the wine into 3 kinds:
The above plot shows the numbers of different quality of wines after categorying them into three kinds.
According to the warning information of the plot above, there are 132 values(non-finite values) removed. So let’s have a check:
## [1] 132
It seems that there are 132 wines’ value of citric acidity are zero. So it’s not strange that the log10 plot above does not show these values.
The plot above shows the distribution of the concentration of citric acid. We can see it’s left-skewed after I log-transformed them and almost all wines are in low concentration of citric acid (below 1 g/dm^3).
The plot above shows the concentration od alcohol among wines. It’s right skewed after I log-transformed them.
The gridplots above show the numbers of residuar.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphats among wines. They’re showed in two ways–before and after x_log10 transformation. It’s very easy to see the distribution of these features.
There are 1599 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality)
(worst) ——-> (best)
quality: 3, 4, 5, 6, 7, 8
Other observations:
The main feature in the data set is quality. I’d like to determine which features determine the quality of wines.
The variables related to acisity (ficed, volatile, citric, pH) may influence the taste of wines. Residual sugar, which indicates the sweetness of the wine, may also play an important role.
I created a rating variable to benefit the later visualization. Also, I find fixed.acidity, volatile.acidity, citric.acid are all about the acidity of the wine, so I create a new variable called FVC, which adds up these three values. Here it is:
Yes. I log-transformed the right skewed fixed.acidity, volatile.acidity, citric.acid and alcohol.
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1 Pearson Pearson
## volatile.acidity -0.2561 1 Pearson
## citric.acid 0.6717 -0.5525 1
## residual.sugar 0.1148 0.001918 0.1436
## chlorides 0.09371 0.0613 0.2038
## free.sulfur.dioxide -0.1538 -0.0105 -0.06098
## total.sulfur.dioxide -0.1132 0.07647 0.03553
## density 0.668 0.02203 0.3649
## pH -0.683 0.2349 -0.5419
## sulphates 0.183 -0.261 0.3128
## alcohol -0.06167 -0.2023 0.1099
## quality 0.1241 -0.3906 0.2264
## FVC.acidity 0.9964 -0.2044 0.6904
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity Pearson Pearson Pearson
## volatile.acidity Pearson Pearson Pearson
## citric.acid Pearson Pearson Pearson
## residual.sugar 1 Pearson Pearson
## chlorides 0.05561 1 Pearson
## free.sulfur.dioxide 0.187 0.005562 1
## total.sulfur.dioxide 0.203 0.0474 0.6677
## density 0.3553 0.2006 -0.02195
## pH -0.08565 -0.265 0.07038
## sulphates 0.005527 0.3713 0.05166
## alcohol 0.04208 -0.2211 -0.06941
## quality 0.01373 -0.1289 -0.05066
## FVC.acidity 0.1245 0.1167 -0.1536
## total.sulfur.dioxide density pH sulphates
## fixed.acidity Pearson Pearson Pearson Pearson
## volatile.acidity Pearson Pearson Pearson Pearson
## citric.acid Pearson Pearson Pearson Pearson
## residual.sugar Pearson Pearson Pearson Pearson
## chlorides Pearson Pearson Pearson Pearson
## free.sulfur.dioxide Pearson Pearson Pearson Pearson
## total.sulfur.dioxide 1 Pearson Pearson Pearson
## density 0.07127 1 Pearson Pearson
## pH -0.06649 -0.3417 1 Pearson
## sulphates 0.04295 0.1485 -0.1966 1
## alcohol -0.2057 -0.4962 0.2056 0.09359
## quality -0.1851 -0.1749 -0.05773 0.2514
## FVC.acidity -0.09628 0.6756 -0.6835 0.1816
## alcohol quality FVC.acidity
## fixed.acidity Pearson Pearson Pearson
## volatile.acidity Pearson Pearson Pearson
## citric.acid Pearson Pearson Pearson
## residual.sugar Pearson Pearson Pearson
## chlorides Pearson Pearson Pearson
## free.sulfur.dioxide Pearson Pearson Pearson
## total.sulfur.dioxide Pearson Pearson Pearson
## density Pearson Pearson Pearson
## pH Pearson Pearson Pearson
## sulphates Pearson Pearson Pearson
## alcohol 1 Pearson Pearson
## quality 0.4762 1 Pearson
## FVC.acidity -0.06667 0.1038 1
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## FVC.acidity 0.99638446 -0.204350914 0.69043814
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## FVC.acidity 0.124487746 0.116674670 -0.153614137
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## FVC.acidity -0.09627567 0.67559618 -0.68348382
## sulphates alcohol quality FVC.acidity
## fixed.acidity 0.183005664 -0.06166827 0.12405165 0.99638446
## volatile.acidity -0.260986685 -0.20228803 -0.39055778 -0.20435091
## citric.acid 0.312770044 0.10990325 0.22637251 0.69043814
## residual.sugar 0.005527121 0.04207544 0.01373164 0.12448775
## chlorides 0.371260481 -0.22114054 -0.12890656 0.11667467
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606 -0.15361414
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029 -0.09627567
## density 0.148506412 -0.49617977 -0.17491923 0.67559618
## pH -0.196647602 0.20563251 -0.05773139 -0.68348382
## sulphates 1.000000000 0.09359475 0.25139708 0.18160349
## alcohol 0.093594750 1.00000000 0.47616632 -0.06666786
## quality 0.251397079 0.47616632 1.00000000 0.10375373
## FVC.acidity 0.181603491 -0.06666786 0.10375373 1.00000000
Exploring these plots, we can easily see that a ‘good’ wine generally has these trends:
Let’s examine how each acid concentration affects pH.
## Correlation: -0.7063602
## Correlation: 0.2231154
## Correlation: -0.7044435
Because we know that pH measures acid concentration(FVC.acidity) using a log sclae, it is not a surprise to find strong correlation between pH and the log of the acid concentration. We can further investigate it by using linear model.
##
## Call:
## lm(formula = pH ~ log10(FVC.acidity), data = subset(df))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46869 -0.06359 -0.00024 0.06385 0.48539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.55344 0.03144 144.82 <2e-16 ***
## log10(FVC.acidity) -1.30534 0.03291 -39.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1096 on 1597 degrees of freedom
## Multiple R-squared: 0.4962, Adjusted R-squared: 0.4959
## F-statistic: 1573 on 1 and 1597 DF, p-value: < 2.2e-16
Now we find that FVC.acidity can only explain half of the variance in pH based on R^2 value. The mean error is relatively bad on poor and excellent wines according to the plot above. So definitely there are other components that affect acidity too.
The above boxplot shows the correlation between sulphates and wine quality. And it’s easy to conclude that better wines seem to have a high concentration of sulphates though there are many outliers in the medium wines.
The correlation here is clear. With the increase of alcohol, the wine tends to have higher quality, especially to the high-end wines.
## Correlation: -0.4909483
The correlation between density and alcohol here makes sense, since we all know that the density of alcohol is smaller than water. So more alcohol means the smaller density (the major component of wine is water) and the two features then should have a negative correlation. That’s exactly the case as showed in the plot.
Firstly I have a look at the correlation among features. And then I further explore the relationships between different features. - pH VS. three acidity(fixed, volatile and citric) and its combined feature–FVC.acidity It would make common sense that higher acidity negatively correlates to pH. However, it would be strange to find that volatile acidity positively correlates to pH, with the correlation equals to 0.223. And it???s easy to say that fixed acidity plays a major role in influencing the pH of one wine. - Sulphates VS. Quality Better wines seem to have a high concentration of sulphates. - Alcohol VS. Quality Better wines tend to have a high concentration of alcohol.
The plot above indicates that the quality of a wine tends to have little relationship with the density. However, good wines seem to have high concentration of alcohol, as discovered in the last plot section.
The plot above indicates that having high alcohol and a high concentration of fixed acidity seem to produce better wines.
Clearly, lower volatile acidity and high alcohol can produce better wines. According to wineQualityInfo.txt downloaded from Udacity, volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. So it makes sense that good wines tend to have lower volatile acidity.
The plot above indicates that high alcohol and low pH is a good match to have a good wine.
It seems that for wines with high alcohol, higher sulphates tend to produce better wines. That’s interesting!
Good wines seem to have a combination of high concentration of alcohol and fixed acidity and lower pH.
Higher sulphates in wines tend to produce better wines for wines with high alcohol.
The majority of wines have a moderate quality. It’s common since the most consumers buy the medium quality of wines considering the price and quality (High quality means high price).
The chart above reveals that alcohol has a great influence on wine quality. It works especially to good quality wines–the wine with the highest quality(8) averagely have a 12% of alcohol in volumn.
The plot divided points into three parts based on the quality rating. Holding alcohol concentration constant, wines with higer sulphates are almost always have better quality than wines with lower sulphates.
The wine quality data set contains information on 1599 wines across 13 variables. I started by understanding the individual variables in the data set based on the introduction from wineQualityInfo.txt, and then I explored interesting questions and leads as I continued to make observations on plots. I mainly focus on the features that may have a influence on the quality of wines.
During the investigation of these features, we find a ‘good’ wine generally has these trends: 1.higher fixed acidity (tartaric acid) and citric acid, lower volatile acidity(quite surprising); 2.lower pH; 3.higher sulphates; 4.higher alcohol; 5.to a lesser extend, lower chlorides and lower density. I was surprised that lower volatile acidity leads to better wines, and it made sense after I found that volatile acidity means the amount of acetic acid in wine, and too high levels of acetic acid can lead to an unpleasant, vinegar taste. The second surprise was finding that the correltion between volatile acidity(acetic acid) and pH was positive. That’s weird. Possibly because pH is not decided only by volatile acidity–other components such as fixed acidity also play a vital role.
Also, I met some problems. When I tried to explore the correlation between the pH and citric acid, it’s common and rational to compare the two based on the same scale. So I log the citric acid and this process brought a problem. Some values in citric are zero and log0 is meaningless. These values were lost and then I couldn’t compute the correlation between the two due to this situation.
In the next stage of analysing the data set, I would prefer to improve my skills on choosing appropriate plots. And it’s also important to think about the question(the interest of the exploration) from different angles and make more precise conclusions.